Datasets for projects
Dataset collections
-
Kaggle competitions (requires registration)
Reddit submission corpus
All Reddit submissions (no discussion) from 2006 thru August 2015
40 GB compressed
- Description and data
- Full corpus: http://reddit-data.s3.amazonaws.com/RS_full_corpus.bz2
- Example: Monthly corpus: http://reddit-data.s3.amazonaws.com/RS_2015-08.bz2
Reddit top 2.5 million
This is a dataset of top posts from reddit. It contains the top 1,000 all-time posts from the top 2,500 subreddits, so 2.5 million posts in total. The top subreddits were determined by subscriber count and are located in the manifest file within.
This data was pulled between August 15-20 of August 2013.
Stack Exchange (= Stack Overflow, Server Fault, Super User, ...)
This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks.
World-Wide Web
Common Crawl is a nonprofit organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of 145 TB of data from 1.81 billion webpages as of August 2015. It completes four crawls a year. (Source: Wikipedia)
-
Available in AWS S3 at: s3://aws-publicdatasets/common-crawl/
- To browse data in S3 using Cyberduck:
- Click Open Connection
- From drop-down select S3
- Server: aws-publicdatasets.s3.amazonaws.com
- Check anonymous login
- Click Connect
- To browse data in S3 using Cyberduck:
Wikipedia
Wikipedia offers free copies of all available content to interested users.
Audioscrobbler dataset
Audioscrobbler, which is now merged with last.fm, once published a database of what music people listened to with the audioscrobbler plugin. Last.fm no longer publishes it, however the initial releases were in the public domain so I can offer it for download.
135 MB compressed, 500MB uncompressed
- Data (last.fm 2005 snapshot)
Million song dataset
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The dataset does not include any audio, only the derived features. Note, however, that sample audio can be fetched from services like 7digital.
The metadata of a track includes information such as artist duration release (album name) title * year
The features derived from the audio track include bars beats danceability energy key loudness sections segments song_hotttnesss tatums tempo time_signature
Covertype dataset
Predicting forest cover type from cartographic variables only (no remotely sensed data)
NCDC Weather data
Worldwide surface weather observations from over 20,000 stations. Hourly measurements. Parameters included are: air quality, atmospheric pressure, atmospheric temperature/dew point, atmospheric winds, clouds, precipitation, ocean waves, tides and more.
For example data for year 2012:
Directory ftp://ftp3.ncdc.noaa.gov/pub/data/noaa/2012
File 010010-99999-2012.gz
File 010014-99999-2012.gz
Network traffic anomaly detection
Raw TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. They operated the LAN as if it were a true Air Force environment, but peppered it with multiple attacks.
This dataset was used for the KDD-99 competition
Enron email dataset
The Enron Corpus is a large database of over 600,000 emails generated by 158 employees of the Enron Corporation. The Enron data was originally collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by a litigation support and data analysis contractor to preserve and collect the vast amounts of data in the wake of the Enron Bankruptcy in December 2001.
Medline citation index
NY city taxi data
Data of taxi trips in NYC (GPS data)
List of trips. For each trip: Taxi identification Driver identification Start time / end time of trip GPS coordiantes of pick up and drop off * Fare